QTM 447 Lecture 21: Autoencoders and Generation

Kevin McAlister

April 1, 2025

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Generative Models

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

All methods we’ll talk about can be sampled!

  • Differences in the ability to do the other 2

Generative Models

Autoregressive Generation

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

  • A collection of images

  • A collection of sentences

The easiest way to approach this is with a common probability identity.

Let \(\mathbf x\) be a vector of inputs - \([x_1,x_2,....,x_P]\)

The joint density over inputs is then:

\[ P(\mathbf x) = f(x_1,x_2,x_3,...,x_P) \]

Autoregressive Generation

The probability chain rule:

\[ f(x_1,x_2,x_3,...,x_P) = f(x_1)f(x_2 | x_1)f(x_3 | x_1,x_2)...f(x_P | x_{P-1}, x_{P-2},...) \]

If it is possible to learn the conditional density of the next input given the previous inputs, then we’ve learned the joint density of the inputs!

Autoregressive Generation

Advancements in recurrent image generation:

  • Pixel CNN - exchange recurrence for CNN style windowing; a little faster

  • ImageGPT - exchange recurrence for masked self-attention; a little faster with really large training sets

These do a great job when we can compute them in finite time!

  • Explicit density that can easily sample new images

Autoregressive Generation

Painfully slow!!!

  • ImageGPT has only been shown to work for up to 64 x 64 input images

  • Literal days of training time to learn joint densities for small-ish data sets

  • Can’t be used generally to create a dictionary of all images!

Autoregressive Generation

What autoregressive methods do well:

  • Explicit density estimation - given a starting prompt/word, we can compute the joint probability of the next word directly

  • Fast at generation (for text)

What autoregressive methods don’t do well:

  • Images (too high dimensional)

  • No feature representation

Autoregressive Generation

What can we do to generate images?

  • We’ll need different machinery.

The rest of the class will be on image/video generation!

  • These methods can be used for text (or other 1d sequential data like sound waves)

  • Autoregressives are just the dominant generation method at this point in time!

Also, very useful for descriptive purposes!

PCA

A classic in the unsupervised literature is principal components analysis.

A more general statement:

Given a corpus of inputs, find a low dimensional representation of the inputs that minimizes reconstruction error

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - d(e(\mathbf x_i))\|^2_2 \]

  • \(e(\mathbf x_i) = \mathbf z_i\) is an encoder function that maps the input to a latent space

  • \(d(\mathbf z_i)\) is a decoder function that maps the latent variable back to original feature space

PCA

How we choose to find \(e()\) and \(d()\) dictates the method

PCA

Given a \(N \times P\) matrix of input features, find a single weight matrix, \(\mathbf W\), with column rank \(K\) that minimizes the objective function:

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf W^T \mathbf x_i\|^2_2 \]

  • If the input features are correlated, then we don’t need to see them all individually to reconstruct the input

  • Find a lower dimensional representation of the input vector that retains most of the information

PCA

\(\mathbf W\) is a \(P \times K\) matrix of weights where \(K << P\) that controls both the mapping of the input to the latent space and the mapping of the latent space back to the input space

\(\mathbf z_i\) is a \(K\)-dimensional latent representation of the input that is derived through a linear mapping - \(\mathbf W^T \mathbf x_i\)

\[ \frac{1}{N} \sum \limits_{i = 1}^N \| \mathbf x_i - \mathbf W \mathbf z_i \|^2_2 \]

  • In theory, \(\mathbf z_i\) embeds the inputs in an interpretable latent space that combines all of the redundant info that we see in the input space!

Two additional constraints:

  • The columns of \(\mathbf W\) should be orthogonal directions

  • The columns of \(\mathbf W\) should be normalized

PCA

PCA operates by using the eigenvalues and eigenvectors of the square covariance matrix for the centered feature matrix:

\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]

where \(\mathbf Q\) is the collection of \(P\) eigenvectors ordered by eigenvalue (from large to small)

  • All of the eigenvalues will be positive since the covariance matrix is positive definite

To create a rank \(K\) approximation of the original feature matrix, set \(\mathbf W = \mathbf Q_{1:K}\)

  • Take the first \(K\) eigenvectors

PCA

What’s neat about PCA is that it produces the best orthogonal rank-K approximation to the input under linearity and squared reconstruction error

  • No other method (under these constraints) will do better!

  • This proof can be found here

PCA

Why this construction?

  • It admits a nice analytical solution with good properties.

  • Can be solved via a basic computer that can find eigenvectors

Computational constraints led to the solution - no other particularly good reason!

PCA

Using PCA, we can:

  • Get rid of redundant features

  • Reduce dimensionality

  • Parse away noise

  • Pass latent scores, \(\mathbf Z\), to supervised learning procedures

PCA

Additionally, the dimensions of the latent space correspond to factors of variation within the training set!

  • Create a feature map for the original input data
  • Feature map corresponds to something about the inputs - perhaps a more abstract notion of features (curviness, thickness, etc.)

A key component of the data scientist’s toolkit!

PCA

Problem: Linearity and strictly invertible mappings are restrictive

Think of PCA as creating a mapping from the feature space to a latent space and then inverting the mapping to return an approximation to the original feature space

  • The latent space - in theory - includes all of the necessary information needed to reconstruct \(\mathbf X\) in a lower dimensional subspace

PCA seeks to solve the generic problem of reconstruction:

\[ \|\mathbf X - d(e(\mathbf X))\|^2_2 \]

  • Other than computational nicety, there’s no reason we can’t find a low dimensional reconstruction using nonlinear methods

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

Deterministic Autoencoders

This is referred to as a deterministic bottleneck autoencoder

  • Learn a set of encoder and decoder functions that map \(\mathbf X\) to itself!

  • Restriction: each input instance passes through a low-dimensional bottleneck ( \(K << P\) )

  • Can’t just learn itself, so needs to set up \(\mathbf Z\) to represent as much of the variation in \(\mathbf X\) as possible

Note that PCA is a special case!

  • Restrict the feedforward layers to have linear activations and be invertible!

No good reason to do this when we have autodiff that can solve for arbitrarily complex models

Deterministic Autoencoders

How do you choose the size of the bottleneck?

  • This largely echoes the same discussion for PCA

  • Just choose a small number and see how well we’re able to recover the images

  • Stop adding dimensions when the next one doesn’t reduce the loss too much

  • Use validation error to choose - typically not needed since this would require serious tuning.

  • Set at 2 for images, 16/32/64/128/etc.

Deterministic Autoencoders

A few notes:

  • Everyone understands PCA, not necessarily autoencoders. Follow disciplinary norms.

  • For images, the CNN backbone helps to induce structure in the recovered images! The reconstruction loss is about the same as a deep feedforward autoencoder in a lot of situations, but it helps to find structure for recovery.

  • For images, autoencoders strictly dominate PCA!

Deterministic Autoencoders

What can we do with deterministic autoencoders?

Like transformers, split for different tasks.

Encoder ( \(\mathbf X \rightarrow \mathbf Z\) )

  • Use as you would for PCA. Pass lower dimensional representation to supervised model for better predictive performance.

  • Use to examine feature maps to understand sources of shared variation within input features.

Deterministic Autoencoders

The decoder is the interesting part!

Decoder ( \(\mathbf Z \rightarrow \mathbf X\) ):

  • In theory, \(\mathbf Z\) shares most of the same information as \(\mathbf X\).

  • Lower dimensional representation means more manageable.

  • The decoder dictates a mapping from \(\mathbf Z\) to \(\mathbf X\)

Any thoughts?

Deterministic Autoencoders

Given an input \(\mathbf z\), we could use that to translate back up to the feature space!

How do we choose \(\mathbf z\) to be viable?

  • Random dart throws?

  • Pick a point in the convex hull of the latent space defined by the training data?

Deterministic Autoencoders

Goal: Come up with a strategy to learn \(P(\mathbf x)\) given a large set of inputs - \(\mathbf X\)

Success:

  • Density estimation: Given a proposed data point, \(\mathbf x_i\), what is the probability with which we could expect to see that data point? Don’t generate data points that have low probability of occurrence!

  • Sampling: How can we generate novel data from the model distribution? We should be able to sample from the distribution!

  • Representation: Can we learn meaningful feature representations from \(\mathbf x\)? Do we have the ability to exaggerate certain features?

Deterministic Autoencoders

This approach doesn’t generate viable images

This isn’t a proper probabilistic model!

  • There isn’t ever an expression of \(P(\mathbf X)\)

  • The method can’t learn how to fill in the gaps between digits appropriately because we never asked it to do that!

We can’t say the probability with which we would expect to see a data point given the data.

We can’t sample from the resulting approximation to \(P(\mathbf X)\) because there isn’t a distribution.

Generative PCA

A probabilistic version of PCA (sort of kinda) is called factor analysis

The construction is pretty similar - \(P\) dimensional feature space to \(K\) dimensional latent space, \(\mathbf Z\)

The difference is that we give structure to the latent space by making a prior assumption

  • Not that consequential since the latent space is to be learned from the data.

  • Just determines what the latent space will look like.

Generative PCA

Most common version - Gaussian factor analysis

\[ f(\mathbf z) = \mathcal N_K(\mathbf z | \mathbf 0 , \mathcal I_K) \]

\[ f(\mathbf x | \mathbf z , \boldsymbol \Theta) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z + \boldsymbol \alpha , \boldsymbol \Psi) \]

where \(\mathbf W\) is a \(P \times K\) matrix of weights, \(\mathbf z\) is a \(K\)-vector of latent values, \(\boldsymbol \alpha\) is an intercept term, and \(\boldsymbol \Psi\) is a \(P \times P\) covariance matrix.

Generative Models

What makes a model generative?

We should be able to provide an answer to the question:

\[ P(\mathbf x | \boldsymbol \Theta) = ? \]

for any viable \(\mathbf x\).

  • A generative model is one where we can answer question about the structure of \(\mathbf x\)

  • Frequently under assumptions, but still can answer it!

Generative Models

Classic example - Logistic Regression vs. QDA

Logistic regression:

\[ P(y = 1 | \mathbf x, \hat{\bb}) = \sigma(\mathbf x^T \hat{\bb}) \]

\[ P(\mathbf x | \hat{\bb}) = \int P(y = 1 | \mathbf x) P(\mathbf x) dy \]

  • We never make any assumption about the structure of \(\mathbf x\), so we can’t compute this quantity!

Generative Models

QDA:

Make an assumption that each \(\mathbf x\) conditional on its class label is a random draw from

\[ \mathbf x | y = c \sim \mathcal N_P(\mathbf x | \boldsymbol \mu_c , \boldsymbol \Sigma_c) \]

  • By assumption, we’re able to address the probability with which we see any \(P\)-vector of feature values given a class label

We can also marginalize over the class label by placing a prior on \(y\):

\[P(\mathbf x | \boldsymbol \mu , \boldsymbol \Sigma) = \sum \limits_{c = 1}^C \mathcal N_P(\mathbf x | \boldsymbol \mu_c , \boldsymbol \Sigma_c) P(y = c)\]

where the prior is a vector of probabilities (summing to 1) over the possible class labels.

Generative Models

In non-generative models, we cannot address the probability with which we would see a given feature vector!

  • Not needed for discriminative classification

In fully generative models, we give enough information to address any probability we want. A generative classifier:

\[ P(\mathbf x | \boldsymbol \Theta) = \int P(\mathbf x | \boldsymbol \Theta , y) P(y) dy \]

\[ P(y | \mathbf x , \boldsymbol \Theta) = \frac{P(\mathbf x | y , \boldsymbol \Theta) P(y)}{P(\mathbf x | \boldsymbol \Theta)} \]

  • The cost is assumptions about structure of \(\mathbf x\)

  • If that’s our goal, though, a necessary choice (no free lunch!)

Generative Models

PCA:

\[ \mathbf z = \mathbf W^T \mathbf x \]

\[ \mathbf x = \mathbf W \mathbf z \]

where \(\mathbf W\) is a \(P \times K\) weight matrix with \(K << P\).

Optimal solution under squared reconstruction error is \(\mathbf W = \mathbf Q_K\) where \(\mathbf Q_K\) is the first \(K\) eigenvectors of the covariance matrix

\[ \mathbf X^T \mathbf X = \mathbf Q \mathbf D \mathbf Q^{-1} \]

with \(\mathbf D\) being a diagonal matrix with the eigenvalues sorted from smallest to largest.

Generative Models

Generative goal:

\[ P(\mathbf x | \mathbf W) = \int P(\mathbf x | \mathbf W , \mathbf z) P(\mathbf z) d \mathbf z \]

For PCA, all we know is that \(\mathbf x = \mathbf W \mathbf z\) and that \(\mathbf z = \mathbf W^T \mathbf x\)

DOES NOT COMPUTE!!!

Factor Analysis

Solution: Assumptions. For centered \(\mathbf x\):

\[ P(\mathbf x | \mathbf W , \boldsymbol \Psi, \mathbf z) = \mathcal N_P(\mathbf x | \mathbf W \mathbf z, \boldsymbol \Psi) \]

  • Each \(\mathbf x\) is a random draw from a multivariate normal distribution centered on the convex combination of \(\mathbf W\) and a latent vector \(\mathbf z\) with covariance matrix equal to \(\boldsymbol \Psi\)

  • Similar to PCA, just saying that we have some uncertainty in the mapping of \(\mathbf z \rightarrow \mathbf x\)

\[ P(\mathbf z) = \mathcal N_K(\mathbf z | 0 , \mathcal I_K) \]

  • Prior to seeing any data, we believe that each \(\mathbf z\) is a random draw from a standard multivariate normal

  • Inconsequential choice since we’re going to learn \(\mathbf z\) anyways

Factor Analysis

Our goal: Find values of \(\mathbf W\) and \(\boldsymbol \Psi\) that maximize the likelihood with which we would observe our input features given the parameters.

\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \prod \limits_{i = 1}^N P(\mathbf x_i | \mathbf W , \boldsymbol \Psi) \]

But \(\mathbf x\) depends on the values of the latent variables, \(\mathbf z\), so:

\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \prod \limits_{i = 1}^N \int \mathcal N_P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d\mathbf z \]

  • Integrate over the latent values since we’re learning these

Factor Analysis

Previously, we’ve seen that MLE problems can be made simpler by optimizing the log-likelihood

\[ \{ \hat{\mathbf W} , \hat{\boldsymbol \Psi}\} = \underset{\mathbf W , \boldsymbol \Psi}{\text{argmax }} \sum \limits_{i = 1}^N \log \int \mathcal N_P (\mathbf x_i | \mathbf W \mathbf z , \boldsymbol \Psi) \mathcal N_K(\mathbf z_i | \mathbf 0 , \mathcal I_K) d\mathbf z \]

Unfortunately, we can’t push the log inside the integral!

  • \(\log(a + b) \neq \log(a) + \log(b)\)

Finding the derivative of a log-integral is not particularly easy…

  • Not even for autodiff…

Factor Analysis

Maximizing this expression is tricky!

What’s holding us back here is that we need to learn \(\mathbf z\) and integrate over it

  • It would be way easier if we knew \(\mathbf z\) beforehand

  • But, \(\mathbf z\) is latent, so we learn it as a function of the data!

Factor Analysis

The Gaussian factor model has many different solution methods:

  • Eigendecomposition (it’s equivalent to PCA with a light twist)

  • MLE via an expectation-maximization routine

We’re going to show a different method that will be extendable for a more general version of factor analysis!

  • A Bayesian method

Bayesian Machine Learning

Why Bayesian Methods?

  • Latent variables have a natural distributional structure

  • “Fills in the gaps” via prior assumptions

  • Everything is a distribution, so everything can be sampled

We’re going to spend some time covering the basics of Bayesian methods and link back to the most common fitting method for generative models for images - variational Bayes

Bayesian Machine Learning

Before we jump into complicated Bayesian methods, we need to understand how Bayesian methods work

Start with alternative solution methods for common problems:

  • IID samples from a normal distribution with known \(\sigma^2\)

  • Linear regression under MSE loss

Bayesian Methods

Recall the setup for the linear regression likelihood approach:

\[Pr(\mathbf y | \mathbf X, \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\]

Given the conditional distribution of the outcome given the features and our model parameters, choose the model parameters that maximize the likelihood with which we would observe the data.

  • The idea here is that there is some true value of \(\boldsymbol \beta\) that generates the data

  • Find the one that is most likely to yield the observed outcomes

Bayesian Methods

An alternative way to view the problem:

Let \(\boldsymbol \theta = \{\boldsymbol \beta, \sigma^2\}\) be the collection of unknowns which we would like to learn from the data, \(\mathcal D\).

Prior to observing any data we have some belief about the values of these unknowns that can be represented via a proper probability density function

\[f(\boldsymbol \theta)\]

Bayesian Methods

The goal then is to condition our prior beliefs on the data we’ve observed

\[f(\boldsymbol \theta | \mathcal D)\]

The rational way to do this is to leverage Bayes Theorem

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]

Bayesian Methods

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]

Likelihood

\[f(\mathcal D | \boldsymbol \theta)\]

  • Given some values of the unknowns, what is the probability with which we would observe the data?

  • This is the same likelihood we’ve already seen!

For a linear regression problem:

\[f(\mathbf y | \mathbf X, \boldsymbol \beta, \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x_i^T \boldsymbol \beta , \sigma^2)\]

Bayesian Methods

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]

Prior

\[f(\boldsymbol \theta)\]

  • Prior to seeing any data, what are our beliefs about the values of the unknowns?

  • Must be a probability density function that integrates to 1

How do we choose a prior distribution?

  • Convenience/Achieve a specific purpose

  • Elicitation (actually use prior knowledge)

  • Low information

Bayesian Methods

For linear regression, we’ll soon see that a convenient choice of prior for the coefficients follows a multivariate normal distribution

Let \(\mathbf z\) be a random vector of length \(P\). We believe that each value of this vector is drawn marginally from a univariate normal distribution. However, each element of \(\mathbf z\) may correlate with other values!

\[\mathbf z \sim \mathcal N_P(\mathbf z | \boldsymbol \mu , \boldsymbol \Sigma)\]

Bayesian Methods

\[f(\mathbf z | \boldsymbol \mu, \boldsymbol \Sigma) = (2 \pi)^{-P/2} \text{det}\left( \boldsymbol \Sigma \right)^{1/2} \exp \left[- \frac{1}{2} (\mathbf z - \boldsymbol \mu)^T \boldsymbol \Sigma^{-1} (\mathbf z - \boldsymbol \mu) \right]\]

  • \(\boldsymbol \mu\) is a \(P\)-vector of element means; what is the average draw of the \(j^{th}\) element of the vector?

  • \(\boldsymbol \Sigma\) is a \(P \times P\) covariance matrix. The diagonal elements tell us the variance of the different elements of \(\mathbf z\). The off-diagonals tell us the pairwise covariance between different elements of \(\mathbf z\).

A convenient prior is then:

\[\boldsymbol \beta \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

  • We’ll see why soon.

Bayesian Methods

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]

Marginal Likelihood

\[f(\mathcal D) = \int \limits_{\theta} f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta) d\theta\]

  • What is the marginal probability with which we would expect to see our data?

  • Think of this as the probability with which we could see the data integrating over our prior

  • How consistent is the data we observe with our prior belief and our choice of model.

  • Higher values mean that holding the prior constant our model does a good job of describing the data.

Bayesian Methods

The marginal likelihood is always a constant scalar value!

  • Since we’re integrating away the unknowns, there are no parts of the resulting formula that are random

  • Therefore, this will not end up being a distribution!

  • A useful way to compare models

However, this quantity can be quite difficult to compute.

  • The integral is often not analytically tractable.

Bayesian Methods

\[f(\boldsymbol \theta | \mathcal D) = \frac{f(\mathcal D | \boldsymbol \theta)f(\boldsymbol \theta)}{f(\mathcal D)}\]

Posterior

\[f(\boldsymbol \theta | \mathcal D)\]

  • Conditioning on the data, what is the distribution over our beliefs about the values of the unknowns?

Since our prior is a probability distribution, our posterior beliefs also follow a probability distribution

  • Probability as a belief rather than long run frequency…

Bayesian Methods

The posterior is a proper probability distribution that encodes:

  • The Bayes consistent likelihood with which we expect that any unknown is equal to any value

  • Our uncertainty about the values of the unknowns after observing data!

What is our uncertainty around maximum likelihood estimates?

Bayesian Methods

Let \(\hat{\boldsymbol \theta}\) be a set of maximum likelihood estimates given a likelihood:

\[\hat{\boldsymbol \theta} = \underset{\boldsymbol \theta}{\text{argmax }} \prod \limits_{i = 1}^N Pr(\mathcal D_i | \boldsymbol \theta)\]

Under sufficient regularity conditions, when \(N \rightarrow \infty\) and appealing to CLT:

\[\hat{\boldsymbol \theta} \sim \mathcal N_P(\hat{\boldsymbol \theta} | \boldsymbol \theta , I_N(\boldsymbol \theta)^{-1})\]

where \(I_N(\boldsymbol \theta)\) is the Fisher information:

\[I_N(\boldsymbol \theta) = -E_{\boldsymbol \theta}\left[\frac{d^2}{d \boldsymbol \theta^2} \sum \limits_{i = 1}^N \log Pr(\mathcal D_i | \boldsymbol \theta)\right]\]

Bayesian Methods

In words, the uncertainty around our estimates are asymptotically normal

  • All of the caveats that come from frequentist statistics

  • Large Sample theory

Bayesian methods yield uncertainty as a function of the prior and the data

  • Native to the posterior distribution!

  • We’ll see this via example

Bayesian Methods

Before jumping into regression, let’s look at a simpler example that demonstrate the Bayesian estimation process.

Example: Normal likelihood, known variance

Suppose that we observe \(N\) instances of \(y\) that we believe are i.i.d. draws from a normal distribution with a known value for the variance. \(\mu\) - the mean of the normal - is unknown.

Likelihood:

\[f(\mathbf y | \mu , \sigma^2) = \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right]\]

Bayesian Methods

Our prior distribution over the unknown, \(\mu\), will be a normal distribution:

\[f(\mu) \sim \mathcal N(\mu | \mu_0 , \sigma^2_0)\]

with some arbitrary mean and variance.

Bayesian Methods

We can define our posterior distribution as:

\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]

This is our posterior distribution!

  • It’s just not in a useful form

  • What is the mode of the posterior?

  • What is the variance of the posterior?

Bayesian Methods

How do we make posteriors useful?

Four strategies:

  • Analytically find an easy to work with form for the posterior

  • Find the posterior mode and apply a Laplace approximation

  • Take a large number of random draws from the posterior (Not covered in this class, see Gibbs and Metropolis approaches)

  • Find an approximation that maximizes the marginal likelihood

Bayesian Methods

Sometimes, posterior distributions can be analytically simplified

  • Generally only the case when our likelihood and prior are of a certain form

Let’s show an example of this.

Bayesian Methods

\[f(\mu | \mathbf y) = \frac{\prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)}{\int \limits_{-\infty}^{\infty} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0) d\mu}\]

First simplification:

The denominator is a constant that is not random!

  • We’re integrating over \(\mu\) and everything else is known!

  • It’s purpose is to ensure that the posterior integrates to 1.

Bayesian Methods

\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \mathcal N(y_i | \mu , \sigma^2)\mathcal N(\mu | \mu_0 , \sigma^2_0)\]

  • Now, let’s write out the equation for the normals to see if there’s any simplifications we can make

\[f(\mu | \mathbf y) = \frac{1}{Z} \prod \limits_{i = 1}^N \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \frac{1}{\sqrt{2 \pi \sigma_0^2}} \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Simplification #2:

We only care about the random terms

  • e.g. terms that have to do with \(\mu\)

  • Anything else can be moved to the front constant

Bayesian Methods

\[f(\mu | \mathbf y) = C \prod \limits_{i = 1}^N \exp \left[ - \frac{1}{2 \sigma^2} (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

The product of exponentials is the sum of the exponents terms:

\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 \right] \exp \left[ - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Applied to wrap the prior in:

\[f(\mu | \mathbf y) = C \exp \left[ - \frac{1}{2 \sigma^2} \sum \limits_{i = 1}^N (y_i - \mu)^2 - \frac{1}{2 \sigma_0^2} (\mu - \mu_0)^2 \right]\]

Bayesian Methods

Expanding the squares and distributing the sum:

\[f(\mu | \mathbf y) \propto \exp \left[ - \frac{1}{2 \sigma^2} \left(\sum y_i^2 - 2 \mu \sum y_i + N \mu^2 \right) - \frac{1}{2 \sigma_0^2} \left(\mu^2 - 2 \mu \mu_0 + \mu_0^2 \right) \right]\]

Combining like terms:

\[f(\mu | \mathbf y) \propto \exp - \frac{1}{2} \bigg(\frac{1}{\sigma^2} \sum y_i^2 - \frac{1}{\sigma^2_0} \mu_0 \\- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \bigg)\]

Now, take advantage of the fact that we can get rid of any multiplicative terms that don’t have to do with \(\mu\)

Bayesian Methods

\[f(\mu | \mathbf y) = C' \exp \left[ - \frac{1}{2} \left(- 2 \mu \left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) + \mu^2 \left(\frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right) \right) \right]\]

This looks a lot like the kernel of a normal distribution:

\[C \exp\left[-\frac{1}{2 s^2}(z - m)^2 \right] = C' \exp\left[-\frac{1}{2 s^2}(z^2 - 2 z m) \right]\]

Can we put this in the form of the Gaussian kernel?

Bayesian Methods

The Gaussian identity for random variable \(y\): \[C' \exp\left[-\frac{1}{2}(ay^2 - 2 b y) \right]\]

is equivalent to a normal distribution (up to a multiplicative constant) where:

\[\hat{\sigma}^2 = a^{-1}\]

\[\hat{\mu} = -a^{-1}b\]

Bayesian Methods

This means that our posterior is a normal distribution:

\[f(\mu | \mathbf y) \sim \mathcal N(\mu | \hat{\mu} , \hat{\sigma}^2)\]

where

\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\]

and

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]

Bayesian Methods

That’s convenient that we ended up with a normal posterior distribution!

It’s not a coincidence

Conjugate Prior

For certain likelihood forms, the posterior is of the form of the prior

  • Normal/Normal

  • Beta/Bernoulli

  • Multivariate Normal/Multivariate Normal

Prior form is chosen out of convenience!

Bayesian Methods

Interesting observation:

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right)\]

  • If the prior variance is small (meaning a really precise prior or we have a strong belief about the location of \(\mu\)), the posterior mean will be really close to \(\mu_0\)

  • The numerator will be dominated by the value of the prior mean

Bayesian Methods

Interesting observation:

What happens if we have a prior with infinite variance?

  • All values of \(\mu\) between \(-\infty\) and \(\infty\) are equally likely a priori

\[\hat{\sigma}^2 = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1} = \frac{\sigma^2}{N}\]

and

\[\hat{\mu} = \left( \frac{N}{\sigma^2} + \frac{1}{\sigma^2_0} \right)^{-1}\left(\frac{1}{\sigma^2}\sum y_i + \frac{1}{\sigma^2_0} \mu_0 \right) = \frac{\sigma^2}{N} \frac{\sum y_i}{\sigma^2} = \frac{\sum y_i}{N}\]

The moments of the posterior for the mean become the maximum likelihood estimates: the sample average and the standard error.

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Methods

We can think of MLE as a special case of Bayesian methods where we have diffuse priors

  • We don’t have any real prior beliefs.

Another interesting result:

As \(N \to \infty\) with a non-infinite variance prior

\[\hat{\sigma}^2 \approx \frac{\sigma^2}{N}\]

\[\hat{\mu} \approx \frac{\sum y_i}{N}\]

  • At a certain point, our prior doesn’t matter!

  • We can see MLE as the asymptotic limit of the posterior distribution as \(N\) gets large

Bayesian Methods

Bayesian Methods

Bayesian Methods

Bayesian Linear Regression

How is this useful for us?

Let’s now look at the linear regression problem. We want to find the posterior over the regression coefficients given the data we observe and assuming a known variance:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2)\]

The likelihood:

\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) = \prod \limits_{i = 1}^N \mathcal N(y_i | \mathbf x^T \boldsymbol \beta , \sigma^2)\]

Because this is a product of independent normals, we can represent this via a \(N\) dimensional multivariate normal distribution:

\[f(\mathbf y | \mathbf X , \boldsymbol \beta , \sigma^2) \sim \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N)\]

Bayesian Linear Regression

The prior:

Let’s say that the prior over the \(P\) coefficients follows a multivariate normal distribution with mean \(\boldsymbol \mu_0\) and covariance matrix \(\boldsymbol \Sigma_0\)

\[f(\boldsymbol \beta) \sim \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

This means that our posterior is:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) = \frac{1}{Z} \mathcal N_N(\mathbf y | \mathbf X \boldsymbol \beta , \sigma^2 \mathcal I_N) \mathcal N_P(\boldsymbol \beta | \boldsymbol \mu_0 , \boldsymbol \Sigma_0)\]

Bayesian Linear Regression

What we know:

  • A multivariate normal distribution (of any dimensionality) is conjugate to a multivariate normal likelihood!

This means that:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]

Bayesian Linear Regression

The proof for this is quite long, but follows closely to the normal/normal proof we did above. It is outlined in all its glory in the lecture notes for Bayesian Regression included on the Canvas site.

The punchline:

\[f(\boldsymbol \beta | \mathbf X , \mathbf y, \sigma^2) \sim \mathcal N_P(\boldsymbol \beta | \hat{\boldsymbol \mu} , \hat{\boldsymbol \Sigma})\]

\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \boldsymbol \Sigma_0^{-1} \right)^{-1}\]

\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \boldsymbol \Sigma^{-1}_0 \boldsymbol \mu_0 \right)\]

Bayesian Linear Regression

Implication 1:

When all of the prior variances are diffuse

\[\hat{\boldsymbol \mu} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X \right)^{-1}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y \right) = (\mathbf X^T \mathbf X)^{-1}\mathbf X^T \mathbf y\]

  • OLS is the solution when we have no prior info!

  • This also holds when the number of training samples goes to \(\infty\)

Bayesian Linear Regression

Implication 2:

Let’s set \(\boldsymbol \mu_0 = \mathbf 0\) and \(\boldsymbol \Sigma_0 = \tau^2 \mathcal I_P\)

This means that:

\[\hat{\boldsymbol \Sigma} = \left(\frac{1}{\sigma^2} \mathbf X^T \mathbf X + \frac{1}{\tau^2} \mathcal I_P \right)^{-1}\]

\[\hat{\boldsymbol \mu} = \hat{\boldsymbol \Sigma}\left(\frac{1}{\sigma^2} \mathbf X^T \mathbf y + \frac{1}{\tau^2} \boldsymbol \mu_0 \right)\]

Bayesian Linear Regression

Let \(\lambda = \frac{\tau^2}{\sigma^2}\). We can rearrange the posterior mean to be:

\[\hat{\boldsymbol \mu} = (\mathbf X^T \mathbf X + \lambda \mathcal I_P)^{-1} \mathbf X^T \mathbf y\]

which is the analytical solution for Ridge regression

Regularization with the L2 norm is Bayesian regression with a normal prior on the coefficients centered at 0!

Bayesian Linear Regression

Regularization methods are us setting up desired or believed values for the regression coefficients

  • We want the regression coefficients to be closer to zero to better generalize!

The L2 penalty (therefore weight decay) is a Bayesian solution to the generalization problem

  • Inject our prior desire to find a solution set that is simpler (e.g. lower norm of the coefficients)

The strength of this regularization is then the level at which we want to simplify our results

  • Set this in accordance with our goal to generalize as well as possible

Bayesian Linear Regression

The Bayesian approach also justifies letting the model be more complex as \(N\) grows

  • Our prior desire for simplicity doesn’t matter as much when \(N\) gets large since the regularized solution converges to the MLE solution via ERM with no penalties.

I find this to be a satisfying justification for regularization!

Next Time

MAP approximations, Marginal Likelihood maximization, and Variational Bayes